## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
The dataset consists of 1599 observations of 12 variables. Variable X appears to be just an id. Quality of wines is measured by integers between 3 and 8. I suppose the full scale is from 1 to 10, but for some reason extreme values have not been used. Other variables are continuous measures of physical qualities of the wine.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
##
## 3 4 5 6 7 8
## 0.6 3.3 42.6 39.9 12.4 1.1
Distribution of wine qualities is bell-shaped with median 6 and mean 5.636. The left tail appears longer, but the right tail is heavier. It might make sense to combine categories, as some of them have only a few observations.
##
## Low High
## 1382 217
It might be easier to work with only two categories of wines instead of the full range of evaluations. The buyers are likely more interested in whether a wine is worth buying or not instead of exact ratings.
Fixed acidity of the wines is concentated around the value 8, with some skew to the right. It will be interesting to see whether the best wines have the highest acidity. In that case they would be easy to identify.
Volatile acidity is much lower than fixed acidity in absolute terms. The distribution appears to have low variance with a few outlier to the right.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
Increasing the resolution reveals an interesting chasm in the middle of the distribution. Why is this?
Many wines have zero or very little citric acid. Otherwise the distribution is quite flat until it starts to decrease around 0.5. There is a curious spike at this value, and a couple less distinct ones at lower values. It looks as if the wine makers might be aiming their wines to have the amount of citric acid either zero, 0.25 or 0.5. Maybe these spikes indicate different types of wines?
Most wines have low amount of residual sugar, between about 1 and 3.5. Some examples have much higher amounts of residual sugars. They are perhaps of different type, like desert wines? It will be interesting to see whether the outliers are of high or low quality.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
Distribution of chlorides resembles the one of residual sugar. Most values are thightly concentrated around 0.08 with a thin and long right tail all the way to 0.6. It looks as if there is a small concentration of wines around 0.4. Is this a distinct subtype or category of wines, or just an artefact in the data? Again, I’m interested to see if this group sticks out as having low or high quality.
Most wines have lowish amounts of free sulfur dioxide. The distribution is again right-skewed.
Same story here as with free sulfur dioxide, but about an order of magnitude higher values. I wonder what is the relationship between free and total amounts of sulfur dioxide?
There is a slightly increasing trend in additional sulfur dioxide when amount of free sulfur dioxide increases. It is still quite common for most of the total sulfur dioxide being accounted for by free sulfur dioxide. I’m creating a new variable fixed.sulfur.dioxide by calculating the difference between the two measures.
Similar distribution as with total sulfur dioxide.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
Density is almost normally distributed around little less than 1, which makes sense as wine is mostly water, and alcohol is less dense than water. Density might actually be correlated with amount of alcholol.
There indeed is a downward trend with increasing alcohol levels. Stronger wines tend to be less dense.
pH of the wines is almost normally distributed around 3.3. Wines are acidic.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
A relatively tight distribution with some skew and outliers to the right.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
Wines typically have at least 9 % alcohol, around 10 % being the average and number of wines slowly decreasing as the alcohol content increases. Wine makers seem to prefer round numbers in alchohol content. There are spikes in the distribution around every .0 and .5.
The red wine dataset consists of 1599 observations of 12 variables (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality). Quality is an ordered categorical variable on a scale from 3 to 8, larger values being the better. Other variables are continuous.
Most wines (82.5 %) have a quality rating of 5 or 6. 7 is the third most common rating (12.4 %) while all the other quality scores cover only 5 % of the wines. Red wine is acidic (pH 2.7-4.0) and usually has only little residual sugar. Mean alcohol content of wines is 10.4 %.
The main feature of interest is quality. I’d like to be able to classify wines to high (quality 7 or 8) and low quality (quality 6 or lower) categories based on some combination of physical measures.
Based on the shapes of distributions, volatile acidity, citric acid, cholrides and alcohol seem promising. Especially alcohol and citric acid distributions feature curious spikes at round values, suggesting the wine makers might be aiming to have specific characteristics on these features, which implies the winemakers believe those features have something to do with the quality of the wine.
I created variable fixed sulfur dioxide by subracting free sulfur dioxide from total sulfur dioxide. I also combined quality categories into a new binary variable quality.bin. In this variable ‘high’ is assigned to wines with quality 7 or 8 and the ‘low’ is assigned to all other wines.
Several of the distributions were right-skewed. I tried a few transformations (logarithmic, cubic root, power) on some of them, but the shapes of the distributions did not improve. In the end I used the features as they are. (With hindsight, classification models could have benefitted from normalization.)
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## fixed.sulfur.dioxide -0.07814929 0.097033939 0.06677604
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## fixed.sulfur.dioxide 0.174529035 0.055479649 0.425148917
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## fixed.sulfur.dioxide 0.95768634 0.09513464 -0.10805328
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
## fixed.sulfur.dioxide 0.032244043 -0.22320257 -0.20546298
## fixed.sulfur.dioxide
## fixed.acidity -0.07814929
## volatile.acidity 0.09703394
## citric.acid 0.06677604
## residual.sugar 0.17452903
## chlorides 0.05547965
## free.sulfur.dioxide 0.42514892
## total.sulfur.dioxide 0.95768634
## density 0.09513464
## pH -0.10805328
## sulphates 0.03224404
## alcohol -0.22320257
## quality -0.20546298
## fixed.sulfur.dioxide 1.00000000
## wine[, 14]: Low
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.600 Min. :0.160 Min. :0.0000 Min. : 0.900
## 1st Qu.: 7.100 1st Qu.:0.420 1st Qu.:0.0825 1st Qu.: 1.900
## Median : 7.800 Median :0.540 Median :0.2400 Median : 2.200
## Mean : 8.237 Mean :0.547 Mean :0.2544 Mean : 2.512
## 3rd Qu.: 9.100 3rd Qu.:0.650 3rd Qu.:0.4000 3rd Qu.: 2.600
## Max. :15.900 Max. :1.580 Max. :1.0000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.03400 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07100 1st Qu.: 8.00 1st Qu.: 23.00
## Median :0.08000 Median :14.00 Median : 39.50
## Mean :0.08928 Mean :16.17 Mean : 48.29
## 3rd Qu.:0.09100 3rd Qu.:22.00 3rd Qu.: 65.00
## Max. :0.61100 Max. :72.00 Max. :165.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9958 1st Qu.:3.210 1st Qu.:0.5400 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6000 Median :10.00
## Mean :0.9969 Mean :3.315 Mean :0.6448 Mean :10.25
## 3rd Qu.:0.9979 3rd Qu.:3.410 3rd Qu.:0.7000 3rd Qu.:10.90
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## fixed.sulfur.dioxide
## Min. : 3.00
## 1st Qu.: 12.00
## Median : 23.00
## Mean : 32.11
## 3rd Qu.: 42.00
## Max. :128.00
## --------------------------------------------------------
## wine[, 14]: High
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.900 Min. :0.1200 Min. :0.0000 Min. :1.200
## 1st Qu.: 7.400 1st Qu.:0.3000 1st Qu.:0.3000 1st Qu.:2.000
## Median : 8.700 Median :0.3700 Median :0.4000 Median :2.300
## Mean : 8.847 Mean :0.4055 Mean :0.3765 Mean :2.709
## 3rd Qu.:10.100 3rd Qu.:0.4900 3rd Qu.:0.4900 3rd Qu.:2.700
## Max. :15.600 Max. :0.9150 Max. :0.7600 Max. :8.900
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 3.00 Min. : 7.00
## 1st Qu.:0.06200 1st Qu.: 6.00 1st Qu.: 17.00
## Median :0.07300 Median :11.00 Median : 27.00
## Mean :0.07591 Mean :13.98 Mean : 34.89
## 3rd Qu.:0.08500 3rd Qu.:18.00 3rd Qu.: 43.00
## Max. :0.35800 Max. :54.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9906 Min. :2.880 Min. :0.3900 Min. : 9.20
## 1st Qu.:0.9947 1st Qu.:3.200 1st Qu.:0.6500 1st Qu.:10.80
## Median :0.9957 Median :3.270 Median :0.7400 Median :11.60
## Mean :0.9960 Mean :3.289 Mean :0.7435 Mean :11.52
## 3rd Qu.:0.9973 3rd Qu.:3.380 3rd Qu.:0.8200 3rd Qu.:12.20
## Max. :1.0032 Max. :3.780 Max. :1.3600 Max. :14.00
## fixed.sulfur.dioxide
## Min. : 4.00
## 1st Qu.: 9.00
## Median : 14.00
## Mean : 20.91
## 3rd Qu.: 22.00
## Max. :251.50
Volatile acidity, citric acid, sulphates and alcohol have moderate correlations with wine quality. These features also have have noticably different means between low and high quality wines.
No clear trends here. Poor and good wines seem to have higher fixed acidity, but on the other hand there are only a few data points on them, so the effect does not feel very trustworthy.
Comparing only two quality categories reveals that actually high quality wines tend to have higher fixed acidity. Combining quality categories is starting to look like a good idea.
There is a clear decreasing trend with volatile acidity when the wine quality increases.
With the increasing quality the distribution of volatile acidity moves to left and gets narrower.
The lower the volatile acidity, the more likely the wine is to be of high quality. Looks like about 0.38 volatile acidity is the sweet spot for red wines.
The very best wines tend to have higher amounts of citric acid.
Interesting! On average good wines tend to have a lot of citric acid, but the density plot reveals the picture is more complex. There seems to be three kinds of wines regarding citric acid: low (close to 0), medium (~0.25) and high (~0.4) amounts of citric acid. Good wines have either a little or a lot of citric acid, while other wines can have any amount of it.
Nothing interesting going on here.
Move along, nothing to see here.
Average quality wines seem to have a little more free sulfur dioxide on average, but this does not help much in differentiating high quality wines from others.
Total sulfur dioxide is a better indicator of whether a wine is good or bad.
High quality wines seem to be along a line where amount of total sulfur dioxide compared to free sulful dioxide is low
The pattern is not very clear, though.
Getting better…
Fixed sulfur dioxide is even better discriminator than total sulfur dioxide! Good wines have low amounts of fixed sulfur dioxide. There’s two extreme outliers.
Looks like low amounts of fixed sulfur dioxide is a pre-requisite but not a guarantee for a high wine quality.
Higher quality wines seem to have lower density. They also have more alcohol, which could cause the correlation. It is probably a good idea to explore how different things affect the density of the wine.
Better wines tend to have slightly lower pH, perhaps in connection to better wines having often higher fixed acidity. The pattern is not very clear though.
Higher amounts of sulphates are associated with higher quality, but there are many outliers in average quality wines that muddy the relationship.
The pattern with alcohol is a little bit U-shaped. The worst quality wines tend to have more alcohol than average wines, and then the better than average wines have increasing amounts of alcohol.
With lower resolution the pattern becomes clearer. Better wines tend to have higher amounts of alcohol.
Wines with more than 12 % alcohol are likely to have high quality, and wines with less than 10 % alcohol are likely poor.
Many of the features are associated with density, which makes sense. Fixed acidity and alcohol seem to have the strongest association. pH has strong correlations with acidity measures, so its association with density is likely to be result of that.
The higher the acidity, the lower the pH. Surprisingly, higher volatile acidity has a weak correlation with higher pH. Maybe volatile acids are “escaping” from the wine?
Fixed and volatile acidity do not have much to do with each other, but citric acid has to do with both of them! Citric acid has positive correlation with fixed acidity and negative correlation with volatile acidity. These relationships look somewhat nonlinear.
Looks like fixed acidity is linearly related to citric acid to some power, perhaps 4.
Here the relationship looks most linear when citric acid values are squared, but the approximation is very rough.
Merging quality categories to just high (7 or 8) and low (6 or below) turned out to be helpful in clarifying the differences between wines. In summary, high quality wines tend to have relatively:
In addition to relationships between physical measures and quality I investigated the composition of acidity in more detail, because two acidity measures correlated with quality, and they interact with each other. Interestingly volatile acidity has negative correlation and fixed acidity has positive correlation with citric acid, but volatile and fixed acidity do not correlate much with each other. The relationships seem to be linear in some power of citric acid, perhaps around 2 (volatile acidity) and 4 (fixed acidity). I also looked at the composition of density, which is likely a result of other measured physical properties.
The strongest correlation I found was between total sulfur dioxide and fixed sulfur dioxide, but the correlation is a result of how the variable was created. After that fixed acidity and pH have the highest correlation (-0.68). Other similarly strong correlations include:
However, for the most interesting correlation is between alcohol and quality (0.48). High quality wines tend to have a lot of alcohol.
Plotting the wines based on their acidity measures reveals two clusters of high-quality wines:
Although there is overlap, many low quality wines could already be identified from this plot: wines presented with red dots above the black line and grey dots below it are likely to be of poor quality (the location of the line is approximate and only for illustration).
Good quality wines form a rather tight cluster. Again it is possible to identify many poor quality wines visually: any grey wine and all wines above the black line are likely to be of poor quality. Classification algorithms could probably do a good job at identifying high-quality wines (quality 7 or 8) using the following features:
## K nearest neighbors predictions:
##
## prediction_1 Low High
## Low 391 40
## High 26 23
## [1] 0.86
## Support vector machine predictions:
##
## prediction_2 Low High
## Low 408 50
## High 9 13
## [1] 0.88
## Random forest predictions:
##
## prediction_3 Low High
## Low 405 32
## High 12 31
## [1] 0.91
Indeed, k nearest neigbors, support vector machine and random forest all work pretty well even without any optimization. In this case random forest has the best performance, achieving 91 % classification accuracy on the test set.
Combination of different acidity measures (fixed acidity, volatile acidity and citric acid) turned out to be useful in visually differentiating between high and low quality wines, as did the combination of fixed sulfur dioxide, sulphates and alcohol. The clustering looked much tighter than I had expected based on the bivariate comparisons. After seeing these plots it was not a surprise that classification models performed well at predicting wine quality.
The biggest surprise was the interaction between sulphates and fixed sulphur dioxide. Neither of them was a strong canditate as a predictor, but together they collected high-quality wines in a tight cluster.
I tried three classification models on the promising features identified during the exploratory analysis. All of them worked well “out of the box”, acchieving around 90 % accuracy. In this case random forest had the best performance with 91 % accuracy on the test set. This means that based on six physical measures of the wine, the random forest model can correctly predict nine times out of ten whether the wine is of high quality. The performance of models could likely be improved further by little optimization. For instance, k-value in k-nearest neighbors model was pulled from a hat, and other models were fitted with default parameters.
Distribution of quality scores for red wines is bell-shaped. As many of the categories have relatively few observations, and the main interest is in differentiating good wines from the rest, it makes sense to combine categories. Only a minority of wines is of high quality.
Amount of alcohol has the strongest association with wine quality. Better wines tend to have more alcohol.
Three acidity measures and amounts of fixed sulfur dioxide, sulphates, and alcohol differentiate high-quality wines from low-quality wines rather well. Dashed lines help illustrate the borders of distinct clusters. In the upper plot, wines represented by red dots above the dashed line and by grey dots below it are likely to have low quality. In the lower plot wines represented with grey dots below the dashed line and all wines above the line are likely to have low quality.
The dataset I explored contained physical measurements of 1599 red wines, along with subjective quality scores. I started by investigating the distributions of individual variables. After that I identified promising correlations between the variables in an effort to find a set of features that could be used to predict wine quality. Instead of the full quality scale I was only interested in differentiating good wines (score 7 or 8) from the rest. Initially the dataset felt confusing and it didn’t look like there was any interesting patterns, but systematically plotting comparisons of variables slowly revealed many interesting relationships. During the analysis the largest surprise was that some of the variables that did not look very good predictors alone, worked very well when combined together. In the end I used six identified features to build three classification models. Random forest performed best, achieving 91 % classification accuracy on the test set. The performance of the models could be improved by optimization and possibly by adding new features. Some of the less promising features could still contain useful information.
I made a couple of detours during the analysis by investigating the relationships of density and pH with other variables they had high correlations with, and by looking to interactions of different acidity measures in detail. In the end I did not get anything useful out of this exploration, except perhaps the decision to leave density and pH out of further analysis, because the information they contain seemed likely to be already accounted for by other variables.